MP02 - Exploring U.S. Housing Affordability and YIMBY Cities

Author

Tova Hirschhorn

Published

October 31, 2025

I. Introduction

Housing affordability remains one of the most urgent challenges in the United States. In fact, it has become a central topic in the upcoming New York City mayoral election. On November 4, 2025, voters will not only weigh three ballot proposals focused on housing affordability, but they will also choose the candidate whose plan they believe best addresses the city’s housing crisis.

Mini-Project 02 explores how data analytics can identify America’s most “YIMBY” (Yes In My Backyard) cities, which enable new housing development through permissive zoning and pro-growth policies. The analysis combines data from the U.S. Census and Bureau of Labor Statistics and integrates demographic, economic, and housing construction indicators across Core-Based Statistical Areas (CBSAs).

Data was pulled and merged using tidycensus and dplyr, and visualized with ggplot2 to highlight regional trends on these topics. Ultimately, the goal is translate these insights into a policy brief advocating for a federal YIMBY incentive program.

II. ERD Structure and Relationships

This Entity-Relationship Diagram (ERD) presents the seven tables used in this analysis: INCOME, RENT, POPULATION, HOUSEHOLDS, PERMITS, WAGES. The ERD illustrates how these datasets are structured and connected through shared keys, enabling seamless integration. By combining these sources, we can analyze demographic, economic, and housing trends across regions and over time.

III. Initial Data Analysis

Code
# Task 2 Q1 - Determining which CBSA permitted the largest number of new housing units between 2010-2019

#Filtering permits for period 2010-2019
permits_2010_2019 <- PERMITS |>
  filter(year >= 2010 & year <= 2019)

#Determine the total number of permits by CBSA
total_permits_CBSA <- permits_2010_2019 |>
  group_by(CBSA)|>
  summarize(total_permits = sum(new_housing_units_permitted, na.rm = TRUE))

#total_permits_CBSA

#Join INCOME table to find CBSA names
total_permits_CBSA<- total_permits_CBSA|>
  left_join(INCOME |> select(GEOID, NAME) |> distinct (),
            by = c("CBSA" = "GEOID"))

#Determine the CBSA with the highest total permits
highest_CBSA <- total_permits_CBSA |>
  arrange(desc(total_permits))|>
  slice(1)

highest_CBSA_name <-highest_CBSA[["NAME"]]

Q1. Between 2010-2019, Houston-Sugar Land-Baytown, TX Metro Area was the CBSA that issued the most housing permits. Today, it is the fifth largest metro area in the United States, and considered a global hub for innovation, with industries focused on aerospace, IT, life sciences and energy.

Code
#Task 2 Q2 - Determining in what year Albuquerque, NM (CBSA 10740) permit the most housing units

#Determining the number of permits in Albuquerque, specific to CBSA 10740. Excluding data of 2020 to reflect COVID-19 events
permits_albuquerque <- PERMITS |>
  filter(CBSA == 10740) |>
  filter(year != 2020)|>
  arrange(desc(new_housing_units_permitted))

#Determining the year and number of permits for inline coding
max_row <- permits_albuquerque |>
  slice_max(new_housing_units_permitted, n=1)

max_year <- max_row[['year']]
max_permits <- max_row[["new_housing_units_permitted"]]

#Renaming the column to create a table that illustrates my findings
permits_albuquerque_renamed <- permits_albuquerque |>
  rename(`New Housing Units Permitted` = new_housing_units_permitted)

Q2. New housing in Albuquerque, NM peaked in 2021, with a total of 4021 permits issues. It is the highest number issued in any single year within this CBSA, as seen in the table below.

Code
#Task 2 Q3 - Determining which STATE has the highest average INDIVIDUAL income in 2015

#Filter the data from 2015 from tables INCOME, HOUSEHOLDS and POPULATION
income_2015 <- INCOME |>filter(year == 2015)
households_2015 <-HOUSEHOLDS|>filter(year == 2015)
population_2015 <-POPULATION|>filter(year == 2015)

#Combines all three datasets using GEOID and year
combined_2015 <- income_2015|>
  left_join(households_2015, by = c("GEOID","year"))|>
  left_join(population_2015, by = c("GEOID","year"), suffix = c("_inc", "_pop"))

#Calculate the total income per CBSA (average HH income*number of HH)
combined_2015 <- combined_2015 |>
  mutate(total_income = household_income * households)
#view(combined_2015)

#Extract the state abbreviation
combined_2015 <- combined_2015 |>
  mutate(state = str_extract(NAME, ", (.{2})", group=1))
#view(combined_2015)

#Determine average individual income per state
state_2015 <- combined_2015 |>
  group_by(state)|>
  summarise(
    total_income=sum(total_income, na.rm = TRUE),
    total_population = sum(population, na.rm = TRUE)
  ) |>
  ungroup()|>
  mutate(avg_ind_income= total_income/total_population)


#Select top state and format table for display
top_state_table <- state_2015|>
  arrange(desc(avg_ind_income)) |>
  slice(1) |>
  mutate (
    `Total Income` = scales::dollar(total_income, accuracy = 1),
    `Total Population` = scales::comma(total_population, accuracy = 1),
    `Average Individual Income` = scales::dollar(avg_ind_income, accuracy = 1)
  ) |>
  select(
    State = state,
    `Total Income`,
    `Total Population`,
    `Average Individual Income`
  )

Q3. In 2015, the state with the highest average individual income was DC, with an average income of $33,233 per person. Washington, DC consistently ranks as one of the highest median household incomes in the country.

Code
#Task 2 Q3 - Create display table
datatable(
  data = top_state_table,
  options = list(
    dom = 't',
    paging = FALSE,
    columnDefs = list(list(className = 'dt-center', targets = "_all"))
  ),
  class = "compact stripe"
)
Code
#Task 2 Q4 - Determining the last year NYC CBSA (code 5182) had the most data scientists in the country.

#Filter for data scientists (5182) in the WAGES table
data_scientists <- WAGES |>
  filter(INDUSTRY == 5182)

# Standardize CBSA codes to join with POPULATION table
t_wages <- data_scientists |> 
  mutate(std_cbsa = paste0(FIPS, "0"))

t_popultation <- POPULATION |> 
  mutate(std_cbsa = paste0("C", GEOID))

# Join the two tables on standardized CBSA code
data_scientists_cbsa <- inner_join(t_wages, t_popultation, join_by(std_cbsa == std_cbsa), relationship = "many-to-many")

#Filter for CBSA name containing "New York" for the NYC area
nyc_data_scientists <- data_scientists_cbsa |>
  filter(str_detect(NAME, "New York"))

#Find the CBSA with the most data scientists each year
highest_cbsa_per_year <- data_scientists_cbsa |>
  group_by(YEAR)|>
  slice_max(EMPLOYMENT, n=1)|>
  ungroup()
  
#Finding all the years that NYC was at the top
nyc_top_years <- highest_cbsa_per_year|>
  filter(str_detect(NAME, "New York"))

#Find the last year that NY had the most data scientists
last_year_nyc_top <- nyc_top_years |>
  summarise(Last_Year = max(YEAR), .groups = "drop")

#for inline coding
last_year_nyc<-last_year_nyc_top$Last_Year

Q4.Despite being home to many major companies with data science roles, New York City last had the highest number of data scientists in 2015. It remains a major hub for data science, but has since experienced a shortage of talent due to increasing demand in STEM fields.

Code
#Task 2 (5) Determine what fraction of total wages in the NYC CBSA was earned by people employed in the finance and insurance industries (NAICS code 52)? In what year did this fraction peak?

#Filter the NYC CBSA rows
nyc_wages <- WAGES |>
  filter(str_starts(FIPS, "C3562"))

#Summarize total wages for NAICS 52 (finance and insurance industries) per year
nyc_wages_summary <- nyc_wages |>
  group_by(YEAR) |>
  summarise(
    total_wages = sum(TOTAL_WAGES, na.rm = TRUE),
    finance_wages = sum(TOTAL_WAGES 
                        [str_starts(as.character(INDUSTRY), "52")], na.rm = TRUE))|>
  mutate(fraction_finance = finance_wages/total_wages)

#Determine the year this fraction peaked
peak_year <- nyc_wages_summary |>
  arrange(desc(fraction_finance))|>
  slice(1)

#for inline coding
peak_year_value <- peak_year$YEAR
peak_fraction <-scales::percent(peak_year$fraction_finance, accuracy = 0.1)

Q5. New York City is similarly widely recognized as the financial capital of the world. Employment in the finance and insurance industries peaked in 2021, accounting for 15.9% of total wages. This underscores their role as major contributors to the GDP and a vital and powerful component to both the city’s and state’s economies.

V. Building Metrics of Housing Affordability and Housing Growth

Rent Burden

We measured the rent burden to calculate the rent-to-income ratio and then standardize it so values could be compared across years and metro areas. The earliest year in the dataset served as the baseline and the metric is scaled from 0-100, where 50 represents the national average. Higher scores indicate higher rent burden, whereas lower ones indicate affordability.

In Chicago, the rent burden fluctuates slightly between 2009 and 2023, showing periods of rising and stabilizing housing costs relative to income. While the city does not rank among the most extreme cases nationally, the figures in the table below suggest that affordability challenges exist. Overall, Chicago’s trend aligns with the patterns observed in large urban housing market, where increases in rent outpace income, and therefore leads to pressure on affordability.

Code
#Table for single Metro Area over time
#Filtering for Chicago Metro Area
chicago_rent <- rent_income_pop|>
  filter(str_detect(NAME, "Chicago"))

#Formatting data for table
chicago_rent_clean <- chicago_rent |>
  select(NAME, year, rent_burden_ratio, rent_burden_scaled) |>
  rename(
    "Metro Area" = NAME,
    "Year" = year,
    "Rent Burden (%)" = rent_burden_ratio,
    "Scaled Rent Burden (0-100)" = rent_burden_scaled
  ) |>
  mutate(
    `Rent Burden (%)` = round(`Rent Burden (%)` * 100, 2),
    `Scaled Rent Burden (0-100)` = round(`Scaled Rent Burden (0-100)`, 2)
  )

# Display table
datatable(
  chicago_rent_clean,
 options = list(
    pageLength = 5,
    searching=FALSE,
    columnDefs = list(list(className = 'dt-center', targets = "_all"))
  ),
  caption = "Chicago Metro Area Rent Burden Over Time",
  class = "compact stripe",
  rownames = FALSE
)



The tables below show the metropolitan areas with the highest and lowest average rent burdens.

Puerto Rico stands out with three of the highest rent burdens, reflecting a tight housing market, high demand, or limited housing supply. These factors very likely place strong financial pressure on residents living on this small island.

Code
#Highlighting the Metro Areas with the highest and lowest rend burden
#Filtering 3 metro areas with highest average rent burden
highest_rent <- rent_income_pop |>
  group_by(NAME) |>
  summarise(avg_rent_burden = mean(rent_burden_ratio, na.rm = TRUE)) |>
  arrange(desc(avg_rent_burden)) |>
  slice_head(n = 3) |>
  rename(
    "Metro Area" = NAME,
    "Average Rent Burden (%)" = avg_rent_burden
  ) |>
  mutate(`Average Rent Burden (%)` = round(`Average Rent Burden (%)` * 100, 2))

# Display table for highest rent burden
datatable(
  highest_rent,
  options = list(
    pageLength = 3,
    searching = FALSE,
    lengthChange = FALSE,
    info = FALSE,
    paging=FALSE,
    columnDefs = list(list(className = 'dt-center', targets = "_all"))
  ),
  caption = "3 Metro Areas with Highest Average Rent Burden",
  class = "compact stripe",
  rownames = FALSE
)


In contrast, the metro areas in New Hampshire, Missouri and Iowa have the lowest average rent burdens, indicating more affordable housing relative to a resident’s income. These regions benefit from lower housing demand, more available units, or even slower population growths, which eases financial pressure on households in comparison with high-burden regions.

Code
#Filtering 3 metro areas with lowest average rent burden
lowest_rent <- rent_income_pop |>
  group_by(NAME) |>
  summarise(avg_rent_burden = mean(rent_burden_ratio, na.rm = TRUE)) |>
  arrange(avg_rent_burden) |>
  slice_head(n = 3) |>
  rename(
    "Metro Area" = NAME,
    "Average Rent Burden (%)" = avg_rent_burden
  ) |>
  mutate(`Average Rent Burden (%)` = round(`Average Rent Burden (%)` * 100, 2))


## Display table for lowest rent burden
datatable(
  lowest_rent,
  options = list(
    pageLength = 3,
    searching = FALSE,
    lengthChange = FALSE,
    info = FALSE,
      paging=FALSE,
    columnDefs = list(list(className = 'dt-center', targets = "_all"))
  ),
  caption = "3 Metro Areas with Lowest Average Rent Burden",
  class = "compact stripe",
  rownames = FALSE
)

Housing Growth

This table shows the five CBSAs with the highest housing growth potential. Positive instantaneous growth indicates that these cities are permitting a significant number of new housing relative to their current population. Meanwhile, negative rate-based growth highlights areas where housing construction is outpacing population growth.

Code
#Cleaning and creating the table for top cities
top_cities_clean <- top_cities |>
  rename(
    "Metro Area" = NAME,
    "Instantaneous Growth (%)" = avg_instantaneous,
    "Rate-Based Growth (%)" = avg_rate_based,
    "Composite Score" = avg_composite
  ) |>
  mutate(
    `Instantaneous Growth (%)` = round(`Instantaneous Growth (%)` * 100, 2),
    `Rate-Based Growth (%)` = round(`Rate-Based Growth (%)` * 100, 2),
    `Composite Score` = round(`Composite Score`, 2)
  )

#Displaying table
datatable(
  top_cities_clean,
  options = list(
    pageLength = 5,
    searching = FALSE,
    lengthChange = FALSE,
    info = FALSE,
    paging=FALSE,
    columnDefs = list(list(className = 'dt-center', targets = "_all"))
  ),
  caption = "Top 5 Metropolitan Areas by Housing Growth Potential (Standardized Metrics)",
  class = "compact stripe",
  rownames = FALSE
)


The five CBSAs listed in the table all exhibit negative instantaneous growth, which indicates that the number of new housing units being permitted is declining relative to the existing population. The absence of data for rate-based growth suggests that these metro areas may be experiencing very low or negative population growth. As a result, the composite score cannot be calculated for these locations.

Code
#Formatting the table for bottom cities
bottom_cities_clean <- bottom_cities |>
  rename(
    "Metro Area" = NAME,
    "Instantaneous Growth (%)" = avg_instantaneous,
    "Rate-Based Growth (%)" = avg_rate_based,
    "Composite Score" = avg_composite
  ) |>
  mutate(
    `Instantaneous Growth (%)` = round(`Instantaneous Growth (%)` * 100, 2),
    `Rate-Based Growth (%)` = ifelse(is.na(`Rate-Based Growth (%)`), "N/A", round(`Rate-Based Growth (%)` * 100, 2)),
    `Composite Score` = ifelse(is.na(`Composite Score`), "N/A", round(`Composite Score`, 2))
  )

#Creating table bottom cities
datatable(
  bottom_cities_clean,
  options = list(
    pageLength = 5,
    searching = FALSE,
    lengthChange = FALSE,
    info = FALSE,
    paging=FALSE,
    columnDefs = list(list(className = 'dt-center', targets = "_all"))
  ),
  caption = "Bottom 5 Metropolitan Areas by Housing Growth Potential (Standardized Metrics)",
  class = "compact stripe",
  rownames = FALSE
)

VI. Visualizing Rent Burden and Housing Growth

The scatterplot depicts the relationship between early rent burden and average composite housing growth across CBSAs. Each point represents a metro area, with red points highlighting potential YIMBY candidates.

Code
#Task 6 - Create 2 visualizations to investigate relationship between Rent Burden and Housing Growth metrics.

#Task 6(i) - Creating dataset that combines rent burden metrics with housing growth metrics for each CBSA over time
#Rent burden summary
rent_summary <- rent_income_pop |>
  group_by(GEOID, NAME) |>
  summarise(
    rent_early = rent_burden_scaled[year == min(year)],
    rent_change = last(rent_burden_scaled) - first(rent_burden_scaled),
    pop_change = last(population) - first(population),
    .groups = "drop"
  )

#Housing growth summary
housing_summary <- pop_permits |>
  group_by(GEOID, NAME) |>
  summarise(
    avg_composite_growth = mean(composite_growth_index, na.rm = TRUE),
    .groups = "drop"
  )

#Combing rent burden and housing growth
yimby_data <- rent_summary |>
  inner_join(housing_summary, by = c("GEOID", "NAME"))

#Identifying YIMBY potential CBSAs
yimby_candidates <- yimby_data |>
  filter(
    rent_early > median(rent_early, na.rm = TRUE),
    rent_change < 0,
    pop_change > 0,
    avg_composite_growth > median(avg_composite_growth, na.rm = TRUE)
  )

#Visualization 1 - Early Rent Burden vs. Average Composite Housing Growth 
p <- ggplot(yimby_data, aes(x = rent_early, y = avg_composite_growth, text = NAME)) +
  geom_point(alpha = 0.5) +
  geom_point(data = yimby_candidates, color = "red", size = 2, aes(text = NAME)) +
  labs(
    x = "Early Rent Burden (Scaled 0-100)",
    y = "Average Composite Housing Growth",
    title = "Early Rent Burden & Housing Growth: Highlighting Potential YIMBY CBSAs",
    subtitle = "Red points indicate high early rent burden and above-average housing growth"
  ) +
  theme_minimal() +
  theme(
    text = element_text(family = "Arial"),
    plot.title = element_text(face = "bold", size = 12, hjust = 0.5),
    plot.subtitle = element_text(size = 9, color = "gray40", hjust = 0.5, margin = margin(b = 10)),
    axis.title = element_text(size = 9),
    axis.text = element_text(size = 8)
  )

# Convert to interactive plotly plot
interactive_plot <- ggplotly(p, tooltip = "text")
interactive_plot


The top five YIMBY cities were highlighted to examine their evolution over time. While they started with relatively high rent burdent, these values have decreased throughout the study period. This indicates that these cities have successfully expanded housing supply in response to affordability pressures. The policies or strategies implemented in these areas may offer valuable lessons for other metropolitan areas facing similar housing challenges.

Code
#Task 6 - Create 2 visualizations to investigate relationship between Rent Burden and Housing Growth metrics.
#Task 6(ii) 
#Filtering for top 5 YIMBY CBSAs
top_yimbys <- yimby_candidates |> 
  slice_max(avg_composite_growth, n = 5)

#Creating plot
ggplot(
  rent_income_pop |> filter(NAME %in% top_yimbys$NAME),
  aes(x = year, y = rent_burden_scaled, color = NAME)
) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(
    title = "Rent Burden Over Time for Top 5 YIMBY CBSAs",
    subtitle = "These CBSAs show high early rent burden, decreasing trends, population growth, and above-average housing growth",
    x = "Year",
    y = "Rent Burden (Scaled 0–100)",
    color = "CBSA"
  ) +
  theme_minimal() +
  theme(
    text = element_text(family = "Arial"),
    plot.title = element_text(face = "bold", size = 12, hjust = 0.5),
    plot.subtitle = element_text(size = 9, color = "gray40", hjust = 0.5),
    axis.title = element_text(size = 9),
    axis.text = element_text(size = 8),
    legend.position = "bottom",
    legend.box = "vertical",         # stack items vertically
    legend.text = element_text(size = 8),
    legend.title = element_text(size = 9)
  ) +
  guides(color = guide_legend(ncol = 1)) 

VII. Policy Brief

Policy Purpose:
To encourage local municipalities to adopt YIMBY-friendly housing policies that will expand housing supply, reduce rent burden, and promote sustainable urban growth.

Primary Sponsor: Representative from The Villages, FL Metro Area
Co-Sponsor: Representative from Buffalo-Cheektowaga-Niagara Falls, NY

Target Stakeholders:
  • Housing and Construction: The Villages, FL Metro Area is generally a large retirement community but has experienced rapid housing expansion amid its growing population. Local builders, contractors, and suppliers would directly benefit from YIMBY-friendly policies that streamline sustained growth.
  • Manufacturing: The Buffalo-Cheektowaga-Niagara Falls, NY Metro Area has a diverse manufacturing foundation, including tech-driven production, automotive manufacturing, and medical device industries. Expanding affordable housing options would help attract and retain skilled workers in these sectors, strengthening the regional labor market and economic competitiveness.
Affordability by the Numbers:
  • Rent Burden: Measures how much a typical household’s income goes toward rent. A decrease in rent burden over time indicates effective improvements in affordability.
  • Housing Growth Metrics:
    • Instantaneous Growth: Measures new housing permits relative to current population.
    • Rate-Based Growth: Measures housing permits relative to 5-year population growth.

Policy Impact: This proposed initiative would allocate federal resources to municipalities that demonstrate measurable progress in reducing rent burdens and expanding housing supply. The Villages, FL, stands out as a model YIMBY success, serving as a benchmark for communities such as the Buffalo-Cheektowaga-Niagara Falls region, which could benefit from targeted assistance to overcome market and regulatory constraints. By aligning housing incentives with measurable affordability outcomes, Congress can foster a balanced approach to urban and suburban development, ultimately strengthening regional economies, enhancing workforce stability, and advancing the national goal of equitable housing access for all Americans.